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Abstract 

The exploration-exploitation trade-off is among the central challenges of rein- 
forcement leaming. The optimal Bayesian solution is intractable in general. This 
paper studies to what extent analytic statements about optimal leaming are possible 
if all beliefs are Gaussian processes. A first order approximation of leaming of 
both loss and dynamics, for nonlinear, time-varying systems in continuous time 
and space, subject to a relatively weak restriction on the dynamics, is described 
by an infinite-dimensional partial differential equation. An approximate finite- 
dimensional projection gives an impression for how this result may be helpful. 

1 Introduction - Optimal Reinforcement Learning 

Reinforcement leaming is about doing two things at once: Optimizing a function while leaming 
about it. These two objectives must be balanced: Ignorance precludes efficient optimization; time 
spent hunting after irrelevant knowledge incurs unnecessary loss. This dilemma is famously known 
as the exploration exploitation trade-off. Classic reinforcement leaming often considers time cheap; 
the trade-off then plays a subordinate role to the desire for leaming a "correct" model or policy. Many 
classic reinforcement learning algorithms thus rely on ad-hoc methods to control exploration, such 
as "e-greedy" [1], or "Thompson sampling" [2]. However, at least since a thesis by Duff [3] it has 
been known that Bayesian inference allows optimal balance between exploration and exploitation. It 
requires integration over every possible future trajectory under the current behef about the system's 
dynamics, all possible new data acquired along those trajectories, and their effect on decisions taken 
along the way. This amounts to optimization and integration over a tree, of exponential cost in the 
size of the state space [4]. The situation is particularly dire for continuous space-times, where both 
depth and branching factor of the "tree" are uncountably infinite. Several authors have proposed 
approximating this lookahead through samples [5, 6, 7, 8], or ad-hoc estimators that can be shown to 
be in some sense close to the Bayes-optimal policy [9]. 

In a parallel development, recent work by Todorov [10], Kappen [1 1] and others introduced an idea to 
reinforcement leaming long commonplace in other areas of machine learning: Stmctural assumptions, 
while restrictive, can greatly simplify inference problems. In particular, a recent paper by Simpkins 
et al. [12] showed that it is actually possible to solve the exploration exploitation trade-off locally, 
by constmcting a linear approximation using a Kalman filter. Simpkins and colleagues further 
assumed to know the loss function, and the dynamics up to Brownian drift. Here, I use their work as 
inspiration for a study of general optimal reinforcement learning of dynamics and loss functions of 
an unknown, nonlinear, time-varying system (note that most reinforcement learning algorithms are 
restricted to time-invariant systems). The core assumption is that all uncertain variables are known up 
to Gaussian process uncertainty. The main result is a first-order description of optimal reinforcement 
leaming in form of infinite-dimensional differential statements. This kind of description opens up 
new approaches to reinforcement leaming. As an only initial example of such treatments. Section 4 
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presents an approximate Ansatz that affords an explicit reinforcement learning algorithm; tested in 
some simple but instructive experiments (Section 5). 

An intuitive description of the paper's results is this: From prior and corresponding choice of leaming 
machinery (Section 2), we construct statements about the dynamics of the learning process (Section 
3). The leaming machine itself provides a probabilistic description of the dynamics of the physical 

system. Combining both dynamics yields a. joint system, which we aim to control optimally. Doing so 
amounts to simultaneously controlling exploration (controlhng the learning system) and exploitation 
(controlUng the physical system). 

Because large parts of the analysis rely on concepts from optimal control theory, this paper will use 
notation from that field. Readers more famihar with the reinforcement learning literature may wish to 
mentally replace coordinates x with states s, controls u with actions a, dynamics with transitions 
p{s' I s, a) and utilities q with losses (negative rewards) — r. The latter is potentially confusing, so 
note that optimal control in this paper will attempt to minimize values, rather than to maximize them, 
as usual in reinforcement leaming (these two descriptions are, of course, equivalent). 



2 A Class of Leaming Problems 



We consider the task of optimally controlling an uncertain system whose states s = (x, i) € /C = 
X M lie in a £>+ 1 dimensional Euchdean phase space-time: A cost Q (cumulated loss) is acquired 
at (x, t) with rate dQ/dt = q{x, t), and the first inference problem is to leam this analytic function 
q. A second, independent learning problem concerns the dynamics of the system. We assume the 
dynamics separate into free and controlled terms affine to the control: 

dx{t) = [f{xj)+g{x,t)u{x,t)]dt (1) 

where u(x, t) is the control function we seek to optimize, and /, g are analytic functions. To simplify 
our analysis, we will assume that either f or g are known, while the other may be uncertain (or, 
altematively, that it is possible to obtain independent samples from both fimctions). See Section 
3 for a note on how this assumption may be relaxed. W.l.o.g., let / be imcertain and g known. 
Information about both q{x, t) and /(x. t) = [fi, . . . , fo] is acquired stochastically: A Poisson 
process of constant rate A produces mutually independent samples 

yq{x,t) = q{x,t)+eq and yfd{x,t) = fd{x,t) + efd where 7V(0, (7^); e/d U{0,af^). (2) 

The noise levels aq and a / are presumed known. Let our initial beliefs about q and / be given by 

Gaussian processes GPkg{q- t-iq^^q)] and independent Gaussian processes Yl^ QVkf^ifd'- l-ifd,^fd), 
respectively, with kernels fc^, fc/i, . . . ,kjD over JC, and mean / covariance functions ji I E. In other 
words, samples over the belief can be drawn using an infinite vector Q. of i.i.d. Gaussian variables, as 



'd{[x,t]) = Hfd{[x,t]) + j Sy/([x,t],[a;',t'])Jl(x',i')da;'dt = /x/d([x,t]) + (Ey/f2)([a;,i]) (3) 

the second equation demonstrates a compact notation for inner products that will be used throughout. 
It is important to note that /, q are unknown, but deterministic. At any point during leaming, we can 
use the same samples O to describe uncertainty, while fi, S change during the leaming process. 

To ensure continuous trajectories, we also need to regularize the control. Following control custom, 
we introduce a quadratic control cost p{u) ~ ^u'R~^u with control cost scaling matrix R. Its units 
[R] = [x/t]/[Q/x] relate the cost of changing location to the utility gained by doing so. 

The overall task is to find the optimal discounted horizon value 



f 

v{x,t) = min / 
" Jt 



g-(T-t)/7 

t 



q[x[t-, u{x, t)],t] + ^u{x, tYR ^w(x, t) 



dr (4) 



where u) is the trajectory generated by the dynamics defined in Equation (1), using the control 
law (policy) u{x, t). The exponential definition of the discount 7 > gives the unit of time to 7. 

Before beginning the analysis, consider the relative generality of this definition: We allow for a 
continuous phase space. Both loss and dynamics may be uncertain, of rather general nonlinear form, 
and may change over time. The specific choice of a Poisson process for the generation of samples is 
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somewhat ad-hoc, but some measure is required to quantify the flow of information through time. 
The Poisson process is in some sense the simplest such measure, assigning uniform probability 
density. An alternative is to assume that datapoints are acquired at regular intervals of width A. This 
results in a quite similar model but, since the system's dynamics still proceed in continuous time, can 
complicate notation. A downside is that we had to restrict the form of the dynamics. However, Eq. 
(1) still covers numerous physical systems studied in control, for example many mechanical systems, 
from classics Uke cart-and-pole to realistic models for helicopters [13]. 



3 Optimal Control for the Learning Process 



The optimal solution to the exploration exploitation trade-off is formed by the dual control [14] of a 
joint representation of the physical system and the beliefs over it. In reinforcement leaming, this idea 
is known as a belief-augmented POMDP [3, 4], but is not usually construed as a control problem. 
This section constructs the Hamilton-Jacobi-Bellman (HJB) equation of the joint control problem 
for the system described in Sec. 2, and analytically solves the equation for the optimal control. This 
necessitates a description of the leaming algorithm's dynamics: 

At time t = t, let the system be at phase space-time Sr — {x{t), t) and have the Gaussian process 
belief GP{q; /Ux(s), St(5) s')) over the function q (all derivations in this section will focus on q, 
and we will drop the sub-script q from many quantities for readability. The forms for /, or g, are 
entirely analogous, with independent Gaussian processes for each dimension d = 1, . . . , D). This 
belief stems from a finite number N of samples ijq — [yi, . . . , yj^iY G collected at space-times 
'S'o = [{xi,ti), . . . , {xN,tN}]'' = [si, . . . ,snY g A;;-'^ (note that ti to In need not be equally 
spaced, ordered, or < r). For arbitrary points s* = {x*,t*) £ K, the behef over q{s*) is a Gaussian 
with mean function fi^, and co-variance function S^- [15] 



= k{s*,So)[K{So, So) + ult\-^y^ 
S,«, 4) = k{s*, s*) - k{s:,So)[K{So, So) + a^ir'kiSo, s*) 



(5) 



where K{So, Sq) is the Gram matrix with elements Kab = k{sa, Sf,). We will abbreviate Kq = 
[K{So, So) + a^I] from here on. The co-vector k{s*,So) has elements fe, = k{s*,Si) and will 
be shortened to ko- How does this belief change as time moves from r to r -|- df? If At — > 0, the 
chance of acquiring a datapoint y^- in this time is A At. Marginalising over this Poisson stochasticity, 
we expect one sample with probability A At, two samples with (A At)"^ and so on. So the mean after 
At is expected to be 



(Ko 


'0 


Is) 




Kr J 





^lr+At = \At{ko,kr)[ ;\' in +(l-Adt-0(Adt)^).feoifo"'yo + 0(Adt)" (6) 



where we have defined the map kr — k{s* , Sr), the vector with elements ^r,i = k{si, Sr), and 
the scalar Kr — k{sr, *v) + o-q- Algebraic re-formulation yields 

fir+At = koK^'vo + Kh - ko'K,'^,)iKt - etKo'it)'\yt - CtKo'vo) dt. (7) 

Note that ^iKo^Vo ~ A*(^r)> the mean prediction at Sr and (k^ — ^T^tT^^r) = '^q + ^{^t, Sr), 
the marginal variance there. Hence, we can define scalars S, a and write 

So the change to the mean consists of a deterministic but uncertain change whose effects accumulate 
linearly in time, and a stochastic change, caused by the independent noise process, whose variance 
accumulates linearly in time (in truth, these two points are considerably subtler, a detailed proof is 
left out for lack of space). We use the Wiener [16J measure Aui to write 

^ ^ *^ , kr-ko^K^'^, [S'^'^](gr)+a^ / *^^vl/2o ^ - ^ l 

'^--(^ ^ = \Kr- CKo'^r)-''- [nSr,Sr)+a^YI^ = ^^""^ ^ + ar A.] 

(9) 

where we have impUcitly defined the innovation function L. Note that L is a function of both s* and 
Sr- A similar argument finds the change of the covariance function to be the deterministic rate 

dE«^«,s*) = -XL,^{s;)Llis*) At. (10) 



3 



So the dynamics of learning consist of a deterministic change to the covariance, and both deterministic 
and stochastic changes to the mean, both of which are samples a Gaussian processes with covariance 
function proportional to LI/T This separation is a fundamental characteristic of GPs (it is the 
nonparametric version of a more straightforward notion for finite-dimensional Gaussian beliefs, for 
data with known noise magnitude). 

We introduce the belief-augmented space H containing states z(r) = [.x(t), r, fJ.q{s), ji^j^, .... /ij^, 
EJ(s, s'), SJj, . . . , S^^]. Since the means and covariances are functions, "H is infinite-dimensional. 
Under our beUefs, z{t) obeys a stochastic differential equation of the form 

d.2 = [A(0) -I- B{z)u + C{z)a\ dt + D{z) dw (11) 

•with free dynamics A, controlled dynamics Bw, uncertainty operator C, and noise operator D 

A= \ji}{z^,zt) ,1,0,0,...,0, -\LqLl , -XLfiL}^ , ... , -AL/^L}^] ; (12) 

B = [g{s*),0, 0, 0, . . . ]; C = diag(s}//, 0, XL,^/' , XLj^^]{\ . . . , XLjo^^ 0, . . . , 0); 

D = diag(0, 0, XLgag, AL/iCT/i, . . . , XLfo^fD, 0, . . . , 0) (13) 

The value - the expected cost to go - of any state s* is given by the Hamilton- Jacobi-Bellman 
equation, which follows from Bellman's principle and a first-order expansion, using Eq. (4): 



v{Zr) 



j{Sr) +^qi^^q+ (7qt^q + \u^ R At + v{Zr+ dt) 



doj An \ (14) 



: mm 



I j /x;+sV2fiq+iuTij-iy+!!^ + ^+[A+Bw+Cn]W^;+^tr[DT(v2t;)D]do|di 

Integration over w can be performed with ease, and removes the stochasticity from the problem; The 
uncertainty over is a lot more challenging. Because the distribution over future losses is correlated 
through space and time, Vw, V^w are functions of Vl, and the integral is nontrivial. But there are some 
obvious approximate approaches. For example, if we (inexactly) swap integration and minimisation, 
draw samples fi* and solve for the value for each sample, we get an "average optimal controller". 
This over-estimates the actual sum of future rewards by assuming the controller has access to the true 
system. It has the potential advantage of considering the actual optimal controller for every possible 
system, the disadvantage that the average of optima need not be optimal for any actual solution. On 
the other hand, if we ignore the correlation between Q. and Vv, we can integrate (17) locally, all terms 
in ri drop out and we are left with an "optimal average controller", which assumes that the system 
locally follows its average (mean) dynamics. This cheaper strategy was adopted in the following. 
Note that it is myopic, but not greedy in a simplistic sense - it does take the effect of learning into 
account. It amounts to a "global one-step look-ahead". One could imagine extensions that consider 
the influence of O on Vu to a higher order, but these will be left for future work. Under this first-order 
approximation, analytic minimisation over u can be performed in closed form, and bears 

u{z) = -i?B(z)T V?;(z) = -Rg{x, tyw^v{z). (15) 

The optimal Hamilton- Jacobi-Bellman equation is then 

-t-'^v{z) =11^ + A^Vt; - ^[VvYBRB'^Vv +^tv [D'^{V\)D] . (16) 
A more explicit form emerges upon re-inserting the definitions of Eq. (12) into Eq. (16): 

j-^v{z) = [/Ug + [iJ.}{zx, zt)"^ + Vt\v{z) - ^[Vxv{z)]'^ g'' {zx, zt)Rg{zx, zt)Vxv{z) 



free drift cost control benefit 



^ -X[L,LlV^^]v{z) + lx'al[L},iVl^^v{z))Lf,] (17) 



C=<},/l,..../£ 



exploration bonus ■ 
^ diffusion cost 



Equation (17) is the central result: Given Gaussian priors on nonlinear control-affine dynamic 
systems, up to a first order approximation, optimal reinforcement learning is described by an infinite- 
dimensional second-order partial differential equation. It can be interpreted as follows (labels in the 
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equation, note the negative signs of "beneficial" terms): Tlie value of a state comprises the immediate 
utility rate; the effect of the free drift through space-time and the benefit of optimal control; an 
exploration bonus of learning, and a diffusion cost engendered by the measurement noise. The first 
two lines of the right hand side describe effects from the phase space-time subspace of the augmented 
space, while the last line describes effects from the belief part of the augmented space. The former 
will be called exploitation terms, the latter exploration terms, for the following reason: If the first 
two lines line dominate the right hand side of Equation (17) in absolute size, then future losses are 
governed by the physical sub-space - caused by exploiting knowledge to control the physical system. 
On the other hand, if the last line dominates the value function, exploration is more important than 
exploitation - the algorithm controls the physical space to increase knowledge. To my knowledge, 
this is the first differential statement about reinforcement leaming's two objectives. Finally, note the 
role of the sampling rate A: If A is very low, exploration is useless over the discount horizon. 

Even after these approximations, solving Equation (17) for v remains nontrivial for two reasons: 
First, although the vector product notation is pleasingly compact, the mean and covariance functions 
are of course infinite-dimensional, and what looks like straightforward irmer vector products are in 
fact integrals. For example, the average exploration bonus for the loss, writ large, reads 

(note that this object remains a function of the state Sr). For general kernels k, these integrals may 
only be solved numerically. However, for at least one specific choice of kernel (square-exponentials) 
and parametric Ansatz, the required integrals can be solved in closed form. This analytic structure 
is so interesting, and the square-exponential kernel so widely used that the "numerical" part of the 
paper (Section 4) will restrict the choice of kernel to this class. 

The other problem, of course, is that Equation (17) is a nontrivial differential Equation. Section 4 
presents one, initial attempt at a numerical solution that should not be mistaken for a definitive answer. 
Despite all this, Eq. (17) arguably constitutes a useful gain for Bayesian reinforcement learning: 
It replaces the intractable definition of the value in terms of future trajectories with a differential 
equation. This raises hope for new approaches to reinforcement leaming, based on numerical analysis 
rather than sampling. 

Digression: Relaxing Some Assumptions 

This paper only applies to the specific problem class of Section 2. Any generalisations and extensions 
are future work, and I do not claim to solve them. But it is instructive to consider some easier 
extensions, and some harder ones: For example, it is intractable to simultaneously leam both g and 
/ nonparametrically, if only the actual transitions are observed, because the beliefs over the two 
functions become infinitely dependent when conditioned on data. But if the belief on either g or f 
is parametric (e.g. a general linear model), a joint belief on g and / is tractable [see 15, §2.7], in 
fact straightforward. Both the quadratic control cost oc u^Ru and the control-affine form {g{x, t)u) 
are relaxable assumptions - other parametric forms are possible, as long as they allow for analytic 
optimization of Eq. (14). On the question of learning the kernels for Gaussian process regression 
on q and / or g, it is clear that standard ways of inferring kernels [15, 17] can be used without 
compUcation, but that they are not covered by the notion of optimal leaming as addressed here. 

4 Numerically Solving the Hamilton- Jacobi-Bellman Equation 

Solving Equation (16) is principally a problem of numerical analysis, and a battery of numeri- 
cal methods may be considered. This section reports on one specific Ansatz, a Galerkin-type 
projection analogous to the one used in [12]. For this we break with the generality of previous 
sections and assume that the kernels k are given by square exponentials fc(a, h) = k,sE.{a, b; 6, S) = 
6"^ exp(— i(a — by S~^{a — b)) with parameters 9, S. As discussed above, we approximate by 
setting = 0. We find an approximate solution through a factorizing parametric Ansatz: Let the 
value of any point z G "H in the beUef space be given through a set of parameters w and some 
nonlinear ^ncft'ona/i 0, such that their contributions separate over phase space, mean, and covariance 
functions: 

V{Z)= (f^ei^eYWe with ti^e € (19) 
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This projection is obviously restrictive, but it sliould be compared to the use of radial basis functions 
for function approximation, a similarly restrictive framework widely used in reinforcement learning. 
The functionals (j) have to be chosen conducive to the form of Eq. (17). For square exponential 
kernels, one convenient choice is 

= fc(s„s„;0„,5a) (20) 
(/>|(2e) = jjj,^zis*i,s*) - k{s*,s*)]k{s*,Sb;eb,Sb)k{s*,Sb;9b,Sb) ds* ds* and (21) 

<^^(^^) = JJ^ i^M)l^z{s*)k{s*,s,, Oc, So)k{s*,Sc, 6c, S,) ds* ds* (22) 

(the subtracted term in the first integral serves only numerical purposes). With this choice, the 

integrals of Equation (17) can be solved analytically (solutions left out due to space constraints). The 
approximate Ansatz turns Eq. (17) into an algebraic equation quadratic in w^, linear in all other Wq. 

]^wl^{z^)w:,-q{z^)+ H'=(2e)t«e = (23) 

using CO- vectors H and a matrix * with elements 

^{z)kt = [V,</),^(z)]T.g(z,)i?.g(z,)T[V,<if,(z)] 

Note that and are both functions of the physical state, through s^. It is through this functional 
dependency that the value of information is associated with the physical phase space-time. To solve 
for w, we simply choose a number of evaluation points Zevai sufficient to constrain the resulting 
system of quadratic equations, and then find the least-squares solution iWopt by function minimisation, 
using standard methods, such as Levenberg-Marquardt [18]. A disadvantage of this approach is that is 
has a number of degrees of freedom 0, such as the kernel parameters, and the number and locations 
Xa of the feature functionals. Our experiments (Section 5) suggest that it is nevertheless possible to 
get interesting results simply by choosing these parameters heuristically. 



5 Experiments 

5.1 Illustrative Experiment on an Artificial Environment 

As a simple example system with a one-dimensional state space, /, q were sampled from the model 
described in Section 2, and g set to the unit function. The state space was tiled regularly, in a bounded 
region, with 231 square exponential ("radial") basis functions (Equation 20), initially all with weight 
ui* ~ 0. For the information terms, only a single basis function was used for each term (i.e. one 
single (t>T,q, one single 0^^, and equally for /, all with very large length scales S, covering the entire 
region of interest). As pointed out above, this does not imply a trivial structure for these terms, 
because of the functional dependency on Lg^ ■ Five times the number of parameters, i.e. A^evai = 1175 
evaluation points Zevai were sampled, at each time step, uniformly over the same region. It is not 
intuitively clear whether each Zg should have its own behef (i.e. whether the points must cover the 
beUef space as well as the phase space), but anecdotal evidence from the experiments suggests that it 
suffices to use the current beliefs for all evaluation points. A more comprehensive evaluation of such 
aspects will be the subject of a future paper. The discount factor was set to 7 = 50s, the samphng 
rate at A = 2/s, the control cost at IQw? / ($s). Value and optimal control were evaluated at time 
steps of 5t = 1/A = 0.5s. 

Figure 1 shows the situation 50s after initialisation. The most noteworthy aspect is the nontrivial 
structure of exploration and exploitation terms. Despite the simplistic parameterisation of the 
corresponding functionals, their functional dependence on St induces a complex shape. The system 
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Figure 1: State after 50 time steps, plotted over phase space-time, top left: (blue is good). 
The belief over / is not shown, but has similar structure, top right: value estimate v at current 
belief: compare to next two panels to note that the approximation is relatively coarse, bottom left: 
exploration terms, bottom right: exploitation terms. At its current state (black diamond), the system 
is in the process of switching from exploitation to exploration (blue region in bottom right panel is 
roughly cancelled by red, forward cone in bottom left one). 



constantly balances exploration and exploitation, and the optimal balance depends nontrivially on 
location, time, and the actual value (as opposed to only uncertainty) of accumulated knowledge. This 
is an important insight that casts doubt on the usefulness of simple, local exploration boni, used in 
many reinforcement learning algorithms. 

Secondly, note that the system's trajectory does not necessarily follow what would be the optimal 
path under full information. The value estimate reflects this, by assigning low (good) value to regions 
behind the system's trajectory. This amounts to a sense of "remorse": If the learner would have 
known about these regions earlier, it would have strived to reach them. But this is not a sign of 
sub-optimality: Remember that the value is defined on the augmented space. The plots in Figure 1 
are merely a slice through that space at some level set in the belief space. 



5.2 Comparative Experiment - The Furuta Pendulum 

The cart-and-pole system is an under-actuated problem widely studied in reinforcement learning. For 
variation, this experiment uses a cylindrical version, the pendulum on the rotating arm [19]. The 
task is to swing up the pendulum from the lower resting point. The table in Figure 2 compares the 
average loss of a controller with access to the true /, g, q, but otherwise using Algorithm 1, to that 
of an e-greedy TD(A) learner with linear function approximation, Simpkins' et al.'s [12] Kalman 
method and the Gaussian process learning controller (Fig. 2). The linear function approximation of 
TD(A) used the same radial basis functions as the three other methods. None of these methods is free 
of assumptions: Note that the sampling frequency influences TD in nontrivial ways rarely studied 
(for example through the coarseness of the e-greedy policy). The parameters were set to 7 = 5s, 
A = 50/s. Note that reinforcement learning experiments often quote total accumulated loss, which 
differs from the discounted task posed to the learner Figure 2 reports actual discounted losses. The 
GP method clearly outperforms the other two learners, which barely explore. Interestingly, none of 
the tested methods, not even the informed controller, achieve a stable controlled balance, although 
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Method 



cumulative loss 




Full Information (baseline) 4.4 ±0.3 

TD(A) 6.401±0.001 

Kalman filter Optimal Learner 6.408±0.001 

Gaussian process optimal learner 4.6 ±1.4 



Figure 2: The Furuta pendulum system; A pendulum of length £2 is attached to a rotatable arm of 
length £1 . The control input is the torque applied to the arm. Right: cost to go achieved by different 
methods. Lower is better. Error measures are one standard deviation over five experiments. 

the GP leamer does swing up the pendulum. This is due to the random, non-optimal location of basis 
functions, which means resolution is not necessarily available where it is needed (in regions of high 
curvature of the value function), and demonstrates a need for better solution methods for Eq. (17). 

There is of course a large number of other algorithms methods to potentially compare to, and these 
results are anything but exhaustive. They should not be misunderstood as a critique of any other 
method. But they highlight the need for units of measure on every quantity, and show how hard 
optimal exploration and exploitation truly is. Note that, for time-varying or discounted problems, 
there is no "conservative" option that cold be adopted in place of the Bayesian answer. 

6 Conclusion 

Gaussian process priors provide a nontrivial class of reinforcement learning problems for which 
optimal reinforcement learning reduces to solving differential equations. Of course, this fact alone 
does not make the problem easier, as solving nonlinear differential equations is in general intractable. 
However, the ubiquity of differential descriptions in other fields raises hope that this insight opens 
new approaches to reinforcement learning. For intuition on how such solutions might work, one 
specific approximation was presented, using functionals to reduce the problem to finite least-squares 
parameter estimation. 

The critical reader will have noted how central the prior is for the arguments in Section 3: The 
dynamics of the leaming process are predictions of future data, thus inherently determined exclusively 
by prior assumptions. One may find this unappealing, but there is no escape from it. Minimizing 
future loss requires predicting future loss, and predictions are always in danger of falling victim to 
incorrect assumptions. A finite initial identification phase may mitigate this problem by replacing 
prior with posterior uncertainty - but even then, predictions and decisions will depend on the model. 

The results of this paper raise new questions, theoretical and applied. The most pressing questions 
concern better solution methods for Eq. (14), in particular better means for taking the expectation 
over the uncertain dynamics to more than first order. There are also obvious probabilistic issues: Are 
there other classes of priors that allow similar treatments? (Note some conceptual similarities between 
this work and the BEETLE algorithm [4]). To what extent can approximate inference methods - 
widely studied in combination with Gaussian process regression - be used to broaden the utility of 
these results? 
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